ggplot::geom_dotplot

Function of the Week

Plot the distribution or density of a variable for your category of interest
Author

Amanda Zucker

Published

February 13, 2025

1 ggplot::geom_dotplot

In this document, I will introduce the geom_dotplot() function and show what it’s for.

#load tidyverse up
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readxl)
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(ggplot2)

NS_PDAC <- read_excel(here::here("function_week", "data", "NS_PDAC.xlsx"),
                      sheet = 2,
                      skip = 1,
                      na = c("N/A", "?"))
glimpse(NS_PDAC)
Rows: 78
Columns: 28
$ `Patient ID`                                 <chr> "8839995", "8839996", "88…
$ Sex                                          <chr> NA, NA, NA, "M", "F", "F"…
$ Age                                          <dbl> NA, NA, NA, 67, 53, 62, 6…
$ BMI                                          <dbl> NA, NA, NA, 26.0, 30.0, 2…
$ `Disease State`                              <chr> "Healthy", "Healthy", "He…
$ Stage                                        <chr> NA, NA, NA, "Metastatic",…
$ Grade                                        <chr> NA, NA, NA, NA, NA, NA, "…
$ `Collected on Surgery Day?`                  <chr> "No", "No", "No", "No", "…
$ `Pre-Surgical`                               <chr> NA, NA, NA, NA, NA, "No",…
$ `s/p Surgery`                                <chr> "No", "No", "No", "Yes, P…
$ Metastatic                                   <chr> NA, NA, NA, "Yes", "Yes",…
$ `Lymphovascular Invasion (surgical samples)` <chr> NA, NA, NA, NA, NA, NA, "…
$ `Perineural Invasion (surgical samples)`     <chr> NA, NA, NA, NA, NA, NA, "…
$ Treatment                                    <chr> NA, NA, NA, "s/p chemo", …
$ `Prior Treatment`                            <chr> NA, NA, NA, NA, "mFOLFIRI…
$ `Current treatment Regimen`                  <chr> NA, NA, NA, "Gem/Abraxane…
$ Radiation                                    <chr> NA, NA, NA, "No", "No", "…
$ Timepoint                                    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ `Disease Years`                              <chr> NA, NA, NA, "3", "2", "2"…
$ `CA19-9 Initial`                             <chr> NA, NA, NA, "89", "1384",…
$ `CA19-9 Current`                             <chr> NA, NA, NA, "2062", "152"…
$ `Associated IPMN?`                           <chr> NA, NA, NA, "No", "no", "…
$ `History of Pancreatitis`                    <chr> NA, NA, NA, "No", "No", "…
$ `Hx diabetes (before diagnosis)`             <chr> NA, NA, NA, "Yes", "No", …
$ `Smoking (pack years)`                       <chr> NA, NA, NA, "0", "25, qui…
$ `Current Alcohol Use`                        <chr> NA, NA, NA, "Yes", "No", …
$ `Prior Cancer History`                       <chr> NA, NA, NA, "No", "No", "…
$ `Family Hx Cancer`                           <chr> NA, NA, NA, "Mother - pan…
NS_PDAC <- clean_names(NS_PDAC)

NS_PDAC <- NS_PDAC %>%
  mutate(across(.cols = c(disease_state, stage, collected_on_surgery_day, history_of_pancreatitis),
                .fns = as.factor))
NS_PDAC <- NS_PDAC %>%
  mutate(across(.cols = c(ca19_9_initial, ca19_9_current),
                .fns = as.numeric))
Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(.cols = c(ca19_9_initial, ca19_9_current), .fns =
  as.numeric)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
glimpse(NS_PDAC)
Rows: 78
Columns: 28
$ patient_id                               <chr> "8839995", "8839996", "883999…
$ sex                                      <chr> NA, NA, NA, "M", "F", "F", "M…
$ age                                      <dbl> NA, NA, NA, 67, 53, 62, 66, 6…
$ bmi                                      <dbl> NA, NA, NA, 26.0, 30.0, 29.0,…
$ disease_state                            <fct> "Healthy", "Healthy", "Health…
$ stage                                    <fct> NA, NA, NA, "Metastatic", "Me…
$ grade                                    <chr> NA, NA, NA, NA, NA, NA, "Mode…
$ collected_on_surgery_day                 <fct> No, No, No, No, No, No, No, N…
$ pre_surgical                             <chr> NA, NA, NA, NA, NA, "No", "No…
$ s_p_surgery                              <chr> "No", "No", "No", "Yes, Palli…
$ metastatic                               <chr> NA, NA, NA, "Yes", "Yes", "Ye…
$ lymphovascular_invasion_surgical_samples <chr> NA, NA, NA, NA, NA, NA, "Yes"…
$ perineural_invasion_surgical_samples     <chr> NA, NA, NA, NA, NA, NA, "Yes"…
$ treatment                                <chr> NA, NA, NA, "s/p chemo", "s/p…
$ prior_treatment                          <chr> NA, NA, NA, NA, "mFOLFIRINOX"…
$ current_treatment_regimen                <chr> NA, NA, NA, "Gem/Abraxane", "…
$ radiation                                <chr> NA, NA, NA, "No", "No", "s/p …
$ timepoint                                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ disease_years                            <chr> NA, NA, NA, "3", "2", "2", "2…
$ ca19_9_initial                           <dbl> NA, NA, NA, 89, 1384, 16, 310…
$ ca19_9_current                           <dbl> NA, NA, NA, 2062, 152, 54, 16…
$ associated_ipmn                          <chr> NA, NA, NA, "No", "no", "No",…
$ history_of_pancreatitis                  <fct> NA, NA, NA, "No", "No", "No",…
$ hx_diabetes_before_diagnosis             <chr> NA, NA, NA, "Yes", "No", "No"…
$ smoking_pack_years                       <chr> NA, NA, NA, "0", "25, quit 20…
$ current_alcohol_use                      <chr> NA, NA, NA, "Yes", "No", "No"…
$ prior_cancer_history                     <chr> NA, NA, NA, "No", "No", "No",…
$ family_hx_cancer                         <chr> NA, NA, NA, "Mother - pancrea…

1.1 What is it for?

geom_dotplot can be used to plot the distribution or density of a variable for your category of interest. They can be plotted vertically or horizontally.

ggplot(NS_PDAC,
       aes(x = age)) +
  geom_dotplot()
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.
Warning: Removed 3 rows containing missing values or values outside the scale range
(`stat_bindot()`).

ggplot(NS_PDAC,
       aes(x = age, fill = disease_state)) +
  geom_dotplot() +
  theme(legend.position = "bottom")
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.
Warning: Removed 3 rows containing missing values or values outside the scale range
(`stat_bindot()`).

ggplot(NS_PDAC,
       aes(x = age, fill = disease_state)) +
  geom_dotplot() +
  theme(legend.position = "bottom")
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.
Warning: Removed 3 rows containing missing values or values outside the scale range
(`stat_bindot()`).

NS_PDAC <- NS_PDAC %>%
  mutate(CA19_9_Level = case_when(
    ca19_9_initial <= 100 ~ "0-100",
    (ca19_9_initial > 100 & ca19_9_initial <= 400) ~ "99-400",
    (ca19_9_initial > 400 & ca19_9_initial <= 800) ~ "401-800",
    ca19_9_initial > 800 ~ "801+"
  ))

glimpse(NS_PDAC)
Rows: 78
Columns: 29
$ patient_id                               <chr> "8839995", "8839996", "883999…
$ sex                                      <chr> NA, NA, NA, "M", "F", "F", "M…
$ age                                      <dbl> NA, NA, NA, 67, 53, 62, 66, 6…
$ bmi                                      <dbl> NA, NA, NA, 26.0, 30.0, 29.0,…
$ disease_state                            <fct> "Healthy", "Healthy", "Health…
$ stage                                    <fct> NA, NA, NA, "Metastatic", "Me…
$ grade                                    <chr> NA, NA, NA, NA, NA, NA, "Mode…
$ collected_on_surgery_day                 <fct> No, No, No, No, No, No, No, N…
$ pre_surgical                             <chr> NA, NA, NA, NA, NA, "No", "No…
$ s_p_surgery                              <chr> "No", "No", "No", "Yes, Palli…
$ metastatic                               <chr> NA, NA, NA, "Yes", "Yes", "Ye…
$ lymphovascular_invasion_surgical_samples <chr> NA, NA, NA, NA, NA, NA, "Yes"…
$ perineural_invasion_surgical_samples     <chr> NA, NA, NA, NA, NA, NA, "Yes"…
$ treatment                                <chr> NA, NA, NA, "s/p chemo", "s/p…
$ prior_treatment                          <chr> NA, NA, NA, NA, "mFOLFIRINOX"…
$ current_treatment_regimen                <chr> NA, NA, NA, "Gem/Abraxane", "…
$ radiation                                <chr> NA, NA, NA, "No", "No", "s/p …
$ timepoint                                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ disease_years                            <chr> NA, NA, NA, "3", "2", "2", "2…
$ ca19_9_initial                           <dbl> NA, NA, NA, 89, 1384, 16, 310…
$ ca19_9_current                           <dbl> NA, NA, NA, 2062, 152, 54, 16…
$ associated_ipmn                          <chr> NA, NA, NA, "No", "no", "No",…
$ history_of_pancreatitis                  <fct> NA, NA, NA, "No", "No", "No",…
$ hx_diabetes_before_diagnosis             <chr> NA, NA, NA, "Yes", "No", "No"…
$ smoking_pack_years                       <chr> NA, NA, NA, "0", "25, quit 20…
$ current_alcohol_use                      <chr> NA, NA, NA, "Yes", "No", "No"…
$ prior_cancer_history                     <chr> NA, NA, NA, "No", "No", "No",…
$ family_hx_cancer                         <chr> NA, NA, NA, "Mother - pancrea…
$ CA19_9_Level                             <chr> NA, NA, NA, "0-100", "801+", …
ggplot(NS_PDAC,
       aes(x = CA19_9_Level)) +
  geom_dotplot()
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.

# Binwidth is the dot density - how close the dots are to each other (larger value = more spaced out, smaller value = closer together)

# I changed the binwidth which plotted the dots fully on the graph, but still have the problem that the x-axis is not in the right order for the categories --> CA19_9_Levels is a 'character' not a factor

ggplot(NS_PDAC,
       aes(x = CA19_9_Level)) +
  geom_dotplot(binwidth = 0.08) 

NS_PDAC$CA19_9_Level <- factor(NS_PDAC$CA19_9_Level)
glimpse(NS_PDAC)
Rows: 78
Columns: 29
$ patient_id                               <chr> "8839995", "8839996", "883999…
$ sex                                      <chr> NA, NA, NA, "M", "F", "F", "M…
$ age                                      <dbl> NA, NA, NA, 67, 53, 62, 66, 6…
$ bmi                                      <dbl> NA, NA, NA, 26.0, 30.0, 29.0,…
$ disease_state                            <fct> "Healthy", "Healthy", "Health…
$ stage                                    <fct> NA, NA, NA, "Metastatic", "Me…
$ grade                                    <chr> NA, NA, NA, NA, NA, NA, "Mode…
$ collected_on_surgery_day                 <fct> No, No, No, No, No, No, No, N…
$ pre_surgical                             <chr> NA, NA, NA, NA, NA, "No", "No…
$ s_p_surgery                              <chr> "No", "No", "No", "Yes, Palli…
$ metastatic                               <chr> NA, NA, NA, "Yes", "Yes", "Ye…
$ lymphovascular_invasion_surgical_samples <chr> NA, NA, NA, NA, NA, NA, "Yes"…
$ perineural_invasion_surgical_samples     <chr> NA, NA, NA, NA, NA, NA, "Yes"…
$ treatment                                <chr> NA, NA, NA, "s/p chemo", "s/p…
$ prior_treatment                          <chr> NA, NA, NA, NA, "mFOLFIRINOX"…
$ current_treatment_regimen                <chr> NA, NA, NA, "Gem/Abraxane", "…
$ radiation                                <chr> NA, NA, NA, "No", "No", "s/p …
$ timepoint                                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ disease_years                            <chr> NA, NA, NA, "3", "2", "2", "2…
$ ca19_9_initial                           <dbl> NA, NA, NA, 89, 1384, 16, 310…
$ ca19_9_current                           <dbl> NA, NA, NA, 2062, 152, 54, 16…
$ associated_ipmn                          <chr> NA, NA, NA, "No", "no", "No",…
$ history_of_pancreatitis                  <fct> NA, NA, NA, "No", "No", "No",…
$ hx_diabetes_before_diagnosis             <chr> NA, NA, NA, "Yes", "No", "No"…
$ smoking_pack_years                       <chr> NA, NA, NA, "0", "25, quit 20…
$ current_alcohol_use                      <chr> NA, NA, NA, "Yes", "No", "No"…
$ prior_cancer_history                     <chr> NA, NA, NA, "No", "No", "No",…
$ family_hx_cancer                         <chr> NA, NA, NA, "Mother - pancrea…
$ CA19_9_Level                             <fct> NA, NA, NA, 0-100, 801+, 0-10…
levels(NS_PDAC$CA19_9_Level) # now CA19_9_Levels is a factor, but still need to reset the order
[1] "0-100"   "401-800" "801+"    "99-400" 
NS_PDAC$CA19_9_Level <- factor(NS_PDAC$CA19_9_Level,
                               levels = c("0-100", "99-400","401-800","801+"))
levels(NS_PDAC$CA19_9_Level)
[1] "0-100"   "99-400"  "401-800" "801+"   
ggplot(NS_PDAC,
       aes(x = CA19_9_Level,)) +
  geom_dotplot(binwidth = 0.08)

ggplot(NS_PDAC,
       aes(x = CA19_9_Level, fill = sex)) + # adding this fill removed missing values
  geom_dotplot(binwidth = 0.08)
Warning: Removed 3 rows containing missing values or values outside the scale range
(`stat_bindot()`).

ggplot(NS_PDAC,
       aes(x = CA19_9_Level, fill = disease_state)) +
  geom_dotplot(binwidth = 0.2, alpha = 0.4)

ggplot(NS_PDAC,
       aes(x = CA19_9_Level, fill = disease_state)) +
  geom_dotplot(binwidth = 0.2, alpha = 1, stackdir = "center")

ggplot(NS_PDAC,
       aes(x = CA19_9_Level, fill = disease_state)) +
  geom_dotplot(binwidth = 0.2, alpha = 1, stackdir = "centerwhole")

# setting stack groups as 'true' enabled me to see every dot color corresponding to the legend
ggplot(NS_PDAC,
       aes(x = CA19_9_Level, fill = disease_state)) +
  geom_dotplot(binwidth = 0.14, alpha = 1, stackdir = "centerwhole", stackgroups = TRUE)
`geom_dotplot()` called with `stackgroups = TRUE` and `method =
"dotdensity"`.", i = "Do you want `binpositions = "all"` instead?

ggplot(NS_PDAC,
       aes(x = CA19_9_Level, fill = disease_state)) +
  geom_dotplot(binwidth = 0.14, alpha = 1, stackdir = "centerwhole", stackgroups = TRUE) +
  scale_fill_viridis_d()
`geom_dotplot()` called with `stackgroups = TRUE` and `method =
"dotdensity"`.", i = "Do you want `binpositions = "all"` instead?

# You can alter whether the dots are stacked according to the x-axis (default) or y-axis (need to indicate this in the code). 
# This can be useful for some datasets because as we can see here, the dots from different categories overlap when binned by the x-axis, but can be more clearly seen in their respective categories when binned according to the y-axis.
ggplot(NS_PDAC, aes(x = hx_diabetes_before_diagnosis, y = CA19_9_Level)) +
  geom_dotplot(stackdir = "center")
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.

ggplot(NS_PDAC, aes(x = hx_diabetes_before_diagnosis, y = CA19_9_Level)) +
  geom_dotplot(binaxis = "y",stackdir = "center")
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.

# You can add summary statistics to the plot using stat_sum(), indicating the statistical value you want with fun=
ggplot(NS_PDAC, aes(x = age, y = CA19_9_Level)) +
  geom_dotplot(binaxis = "y",stackdir = "center") +
  stat_sum(fun=mean, geom = "point", color = "red")
Warning in stat_sum(fun = mean, geom = "point", color = "red"): Ignoring
unknown parameters: `fun`
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.
Warning: Removed 3 rows containing missing values or values outside the scale range
(`stat_bindot()`).
Warning: Removed 3 rows containing non-finite outside the scale range
(`stat_sum()`).

# You can use the function coord_flip() to change what is on the x and y axes - can do this with a plot that is saved as an object. 
# In this example, we would want to adjust the binwidth when flipping the axes to visualize all of the data.
ggplot(NS_PDAC, aes(x = hx_diabetes_before_diagnosis, y = CA19_9_Level)) +
  geom_dotplot(binaxis = "y",stackdir = "center") +
  coord_flip()
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.

ggplot(NS_PDAC, aes(x = hx_diabetes_before_diagnosis, y = CA19_9_Level)) +
  geom_dotplot(binaxis = "y",stackdir = "center",binwidth = 0.06) +
  coord_flip()

1.2 Is it helpful?

Discuss whether you think this function is useful for you and your work. Is it the best thing since sliced bread, or is it not really relevant to your work?

It can be helpful, but definitely depends on what you want to plot. If you have specific variables that you want to plot on the x and y axes, this will be a less useufl funciton as you can specify your x-axis variable, but not one for the y-axis. For example, I could not plot CA19-9 levels by disease state category - this would be more effectively done with geom_point().